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ABSTRACT 

The polyhedral model provides a powerful mathematical ab¬ 
straction to enable effective optimization of loop nests with 
respect to a given optimization goal, e.g., exploiting par¬ 
allelism. Unexploited reduction properties are a frequent 
reason for polyhedral optimizers to assume parallelism pro¬ 
hibiting dependences. To our knowledge, no polyhedral loop 
optimizer available in any production compiler provides sup¬ 
port for reductions. In this paper, we show that leveraging 
the parallelism of reductions can lead to a significant per¬ 
formance increase. We give a precise, dependence based, 
definition of reductions and discuss ways to extend polyhe¬ 
dral optimization to exploit the associativity and commu¬ 
tativity of reduction computations. We have implemented 
a reduction-enabled scheduling approach in the Polly poly¬ 
hedral optimizer and evaluate it on the standard Polybench 
3.2 benchmark suite. We were able to detect and model all 
52 arithmetic reductions and achieve speedups up to 2.21 x 
on a quad core machine by exploiting the multidimensional 
reduction in the BiCG benchmark. 

Categories and Subject Descriptors 

D 3.4 [Programming languages]: Processors— Compil¬ 
ers, Optimization 

General Terms 

Algorithms; Performance 

Keywords 

Compiler Optimization; Affine Scheduling; Reductions 

1. INTRODUCTION 

Over the last four decades various approaches [HEIlIll 
[ItI 1^ 1^ were proposed to tackle reduc¬ 

tions: a computational idiom which prevents parallelism due 
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to loop carried data dependences. An often used definition 
for reductions describes them as an associative and commu¬ 
tative computation which reduces the dimensionality of a 
set of input data [16|. A simple example is the array sum 
depicted in Figure!^ The input vector A is reduced to 
the scalar variable sum using the associative and commuta¬ 
tive operator +. In terms of data dependences, the loop has 
to be computed sequentially because a read of the variable 
sum in iteration i -|-1 depends on the value written in itera¬ 
tion i. However, the associativity and commutativity of the 
reduction operator can be exploited to reorder, parallelize 
or vectorize such reductions. 

While reordering the reduction iterations is always a valid 
transformation, executing reductions in a parallel context re¬ 
quires additional “fix up”. Static transformations often use 
privatization as fix up technique as it works well with both 
small and large parallel tasks. The idea of privatization is 
to duplicate the shared memory locations for each instance 
running in parallel. Thus, each parallel instance works on 
a private copy of a shared memory location. Using the pri¬ 
vatization scheme we can vectorize the array sum example 
as shown in Figure [T^ For the shared variable sum, a tem¬ 
porary array tmp_sum, with as many elements as there are 
vector lanes, is introduced. Now the computation for each 
vector lane uses one array element to accumulate interme¬ 
diate results unaffected by the computations of the other 
lanes. As the reduction computation is now done in the 
temporary array instead of the original reduction location 
we finally need to accumulate all intermediate results into 
the original reduction location. This way, users of the vari¬ 
able sum will still see the overall sum of all array elements, 
even though it was computed in partial sums first. 

for (i = 0; i < 4 * N; i++) 
sum += A[i] ; 

(a) Sequential array sum computation. 

tmp_sum[4] = {0,0,0,0} 
for (i = 0; i < 4 * N; i+=4) 
tmp_sum[0:3] += A[i:i+3]; 

sum += tmp_sum[0] + tmp_sum[l]; 

+ tmp_sum[2] + tmp_sum[3]; 

(b) Vectorized array sum computation. 

Figure 1: A canonical example of a single address reduction. 


1 




Transformations as described above have been the main 
interest of reduction handling approaches outside the poly¬ 
hedral world. Associativity and commutativity properties 
are used to extract and parallelize the reduction loop [10[ 
or to parallelize the reduction computation with regards 
to an existing surrounding loop [l^ [M]. While 
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prior work on reductions in the polyhedral model |22[ 23 
1^ was focused on system of affine recurrences (SAREsJ 


we look at the problems a production compiler has to solve 
when we allow polyhedral optimizations that exploit the re¬ 
duction properties. To this end our work supplements the 
polyhedral optimizer Polly [^, part of the LLVM com¬ 
piler framework with awareness for the associativity and 
commutativity of reduction computations. While we are 
still in the process of upstreaming, most parts are already 
accessible in the public code repository. 

The contributions of this paper include: 


• A powerful algorithm to identify reduction dependences, 
applicable whenever memory or value based depen¬ 
dence information is available. 


• A sound model to relax memory dependences with re¬ 
gards to reductions and its use in reduction-enabled 
polyhedral scheduling. 

• A dependence based approach to identify vectorization 
and parallelization opportunities in the presence of re¬ 
ductions. 


The remainder of this paper is organized as follows: We 
give a short introduction into the polyhedral model in Sec¬ 
tion]^ Thereafter, in Section our reduction detection is 
described. Section |4] discusses the benefits and drawbacks 
of different reduction parallelization schemes, including pri¬ 
vatization. Afterwards, we present different approaches to 
utilize the reduction properties in a polyhedral optimizer in 
Section]^ In the end we evaluate our work (Section]^, com¬ 
pare it to existing approaches (Section]^ and conclude with 
possible extensions in Section]^ 


2. THE POLYHEDRAL MODEL 

The main idea behind polyhedral loop nest optimizations 
is to abstract from technical details of the target program. 
Information relevant to the optimization goal is represented 
in a very powerful mathematical model and the actual opti¬ 
mizations are well understood transformations on this rep¬ 
resentation. In the context of optimization for data locality 
or parallelism, the relevant information is the iteration space 
of each statement, as well as the data dependences between 
individual statement instances. 


for (i = 

0; 

i < 

NX; i++) { 


R: q[i] = 

0; 




for (j 

= 

0; j 

< NY; j++) { 


S: q[i] 

= 

q[i] 

+ A[i][j] * pi 

: j 

T: s[j] 

= 

s [ j] 

+ r[i] * A[i]1 

: j 


} 

Figure 2: BiCG Sub Kernel of BiCGStab Linear Solver. 

Figurej^shows an example program containing three state¬ 
ments R, S and T in a loop nest of depth two. Figure 
shows the polyhedral representation of the individual iter¬ 
ation spaces for all statements, as well as value-based data 


dependences between individual instances thereof. R has a 
one-dimensional iteration space, as it is nested in the i-loop 
only. Statements S and T have a two-dimensional iteration 
space as they are nested in both the i-loop as well as in the 
j-loop. The axes in the Figure correspond to the respective 
loops. Single instances of each statement are depicted as 
dots in the graph. Dependences between individual state¬ 
ment instances are depicted as arrows: dashed ones for reg¬ 
ular data dependences and dotted ones for loop carried data 
dependences. 
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Figure 3: Polyhedral representation of statements R, S and 
T of the BiCG Sub Kernel of Figure 

In the polyhedral model the iteration space of a statement 
Q is represented as a multidimensional Z-polytope Tq, de¬ 
fined by affine constraints on the iteration variables of loops 
surrounding the statement, as well as global parameters. 
The latter are basically loop invariant expressions like for ex¬ 
ample the upper bounds NX and NY of the loops in Figurej^ 
As a consequence, the polyhedral model is only applicable 
to well structured program parts with affine loop bounds 
and memory access functions, so called Static Control Parts 
(SCoPs) [^. While, there are different over-approximations 
to increase the applicability (e.g., by Benabderrahmane [^) 
we will assume that all restrictions of SCoPs are fulfilled. 

The dependences between two statements Q and T are 
also represented as a multidimensional Z-polytope known 
as the dependence polytope Tt<Q,T>- It contains a point < 
IqCt > for every pair of instances < iq >£Xq and < ir >€ 
Xt for which the latter depends on the former. To ease 
reading we will however omit the index of the dependence 
polytopes and only argue about the set of all dependences 
T>, defined as: 

T> '■= {< iq, ir > \ yQ,T G SCoP :< jq, it >£ D<q,t>} 

Later we will also distinguish all Write-After-Write [WAW 
or output dependence) dependences of T> by writing T>waw- 

A loop transformation in the polyhedral model is repre¬ 
sented as an affine function 9q for each statement Q. It is 
often called scheduling or scattering function. This func¬ 
tion translates a point in the original iteration space Xq of 
statement Q into a new, transformed target space. One im¬ 
portant legality criterion for such a transformation is that 
data dependences need to be respected: The execution of 
every instance of a source statement Q of a dependence has 
to precede the execution of the corresponding target state¬ 
ment T in the transformed space. Formulated differently: 
the target iteration vector of the value producing instance 
of Q has to be lexicographically smalleij^ than the target 

^To compare two vectors of different dimensionality, we sim¬ 
ply £11 up the shorter vector with zeros in the end to match 
the dimensionality of the larger one. 
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iteration vector of the consuming instance of T: 

< iQyir >£ 2 ? => dQ(iQ) OriiT) ( 1 ) 

Multiple statements, or multiple instances of the same state¬ 
ment, that are mapped to the same point in the target space, 
can be executed in parallel. However, implementations of 
polyhedral schedulers usually generate scheduling 

functions with full rank, thus rank(dom{0)) = rank{img{6)). 
The parallelism is therefore not explicit in the scheduling 
function but is exposed later when the polyhedral represen¬ 
tation is converted to target code. 

There are two things that make the described model par¬ 
ticularly interesting for loop transformation: First, unlike 
classical optimizers, a polyhedral optimizer does not only 
consider individual statements, but instead individual dy¬ 
namic instances of each statement. This granularity leads 
to a far higher expressiveness. Second, the combination of 
multiple classical loop transformations, like for instance loop 
skewing, reversal, fusion, even tiling, typically used as atoms 
in a sequence of transformations, can be performed in one 
step by the scattering function. There is no need to come 
up with and evaluate different, possibly equivalent or even 
illegal combinations of transformations. Instead, linear opti¬ 
mization is used to optimize the scattering function for every 
individual statement with respect to an optimization goal. 


3. DETECTING REDUCTIONS 

Pattern based approaches on source statements are lim¬ 
ited to find general reduction idioms [2^ [2^. The two 
main restrictions are the amount of patterns in the com¬ 
piler’s reduction pattern database and the sensitivity to the 
input code quality or preprocessing steps. To become as in¬ 
dependent as possible of source code quality and canonical- 
ization passes we replace the pattern recognition by a simple, 
data flow like analysis. This analysis will identify reduction¬ 
like computations within each polyhedral statement. Such a 
computation is a potential candidate for a reduction, thus it 
might be allowed to perform the computation in any order 
or even in parallel. Afterwards, we utilize the polyhedral 
dependence analysis to precisely identify all reduction 


dependences 20 in a SCoP, hence to identify the actual 


reduction computations from the set of possible candidate 
(reduction-like) computations. 


1 = load A[f(is,J3)] store x A[f(is,p)] 
X = y © z 

Figure 5: SSA-based language subset. 


The following discussion is restricted to the SSA-based 
language subset (insts) depicted in Figure]^ Our imple¬ 
mentation however handles all LLVM-IR instructions. 

The binary operation is parametrized with 0 and can be 
instantiated with any arithmetic, bit-wise or logic binary op¬ 
erator. To distinguish associative and commutative binary 
operators we use © instead. The load instruction is applied 
to a memory location. It evaluates to the current value x 
stored in the corresponding memory location. The store 
instruction takes a value x and writes it to the given mem¬ 
ory location. In both cases the memory location is described 
as A[f {is,p) ], where A is a constant array base pointer 
and f {is,p) is an affine function with regards to outer loop 
indices of the statement S (is) and parameters (p) of the 


SCoP. The range of a memory instruction is defined as the 
range of its affine access function: 

ran(store x A[f {is,p) ]) ■= ran{A[f {is,p) ]) 
ran(load A[f {is,p) ]) ■= ran{A[f {is,p) ]) 
ran{A[f {is,p) ]) ■= A-\-ran{f {is,p)) 

Note the absence of any kind of control flow producing or 
dependent instructions {(j) instructions or branches). This 
is a side effect of the limited scope of the reduction detec¬ 
tion analysis. It is applied only to polyhedral statements, in 
our setting basic blocks with exactly one store instruction. 
Furthermore, we assume all loop carried values are commu¬ 
nicated in memory. This setup is equivalent to C source 
code statements without non-memory side effects. 


3.1 Reduction-like Computations 

Reduction-like computations are a generalization of the 
reduction definition used e.g., by Jouvelot or Rauchw- 
erger |21| . Their main characteristic is an associative and 
commutative computation which reduces a set of input val¬ 
ues into reduction locations. Furthermore, the input values, 
the control flow and any value that might escape into a non¬ 
reduction location needs to be independent of the intermedi¬ 
ate results of the reduction-like computation. The difference 
between reduction-like computations and reductions known 
in the literature is the restriction on other appearances of 
the reduction location in the loop nest. We do not restrict 
syntactic appearances of the reduction location base pointer 
as e.g., Rauchwerger does, but only accesses to the ac¬ 
tual reduction location in the same statement. This means 
a reduction-like computation on A[i%2] is not invalidated 
by any occurrence of A[i%2 + 1] in the same statement 
or any occurrence of A in another statement. 

It is crucial to stress that we define reduction-like compu¬ 
tations for a single polyhedral statement containing only a 
single store. Thus intermediate results of a reduction-like 
computation can only escape if they are used in a differ¬ 
ent statement or outside the SCoP. As we focus on mem¬ 
ory reductions in a single statement we will assume such 
outside uses invalidate a candidate computation from being 
reduction-like. To this end we define the function: 

hasOutsideUses : Insts -A bool 


that returns true if an instruction is used outside its state¬ 
ment. In Section [3.31 we explain how the situation changes if 
multiple statements are combined into compound statements 
in order to save compile time. 

Reconsider the array sum example in Figure The re¬ 
duction location is the variable sum, a scalar variable or 
zero dimensional array. However, we do not limit reduction¬ 
like computations to zero dimensional reduction locations, 
instead we allow multidimensional reduction locations, also 
called histogram reductions 18 , as well. The second exam¬ 
ple, Figure shows two such multidimensional reductions. 
The reduction locations are q [ i ] and s [ j ]. The first is 
variant in the outer loop, the second in the inner loop. 

To detect reduction-like computations we apply the de¬ 
tection function ts , shown in Figure]^ to the store in the 
polyhedral statement S. The idea is to track the flow of 
loaded values through computation up to the store. To 
this end, is (l) for any instruction I will assign each load 
a symbol that describes how the value loaded by load used 
up to and by I. We will use TZop to refer to the set of all 
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ts (1 = load A[f {is,p) ]) 
is (x = Y © z) 


is (store x A[f (is,p) ]) 


XI : if (Z 7^ 1) then _L else 

if (hasOutsideUsesil)) then T else t 
XI ■■ if ({ (is (j/))(Z), (is («))(0 } = {-L,-L }) then _L else 

if -n{isCommutative{Q) A isAssociative{Q)) then T else 
if {hasOutsideUses{x)) then T else 
if ({ (is {y)){l), (is iz)){l) } = {t, -L }) then 0 else 
if ({ (is (is {z))il) } = { 0, -L }) then © else T 

XI '■ if X ^ Insts then _L else 

if {ran{l) n ron(A[f (is,p) ]) = 0) then T else 

if (3/' : I ^ I' A ran{l') n ran(l) n ran{A [ f (is,p) ]) / 0) then T else (is (x))(Z) 


Figure 4: Detection function for reduction-like computations: is : Insts -A {loads{S) -A TZop)- 


four symbols. It includes the _L indicating that the load 
was not used by the instruction, the t to express that it 
was only loaded but not yet used in any computation, the T 
stating that the loaded value may have been used in a non- 
associative or non-commutative computation. Additionally, 
the © is used when the loaded value was exactly one in¬ 
put of a chain of © operations. Note that only a load 1 
that flows with © into the store is a valid candidate for 
a reduction-like computation and only if the load and the 
store access (partially) the same memory. Furthermore, we 
forbid all other load instructions in the statement to access 
the same memory as both 1 and the store as that would 
again make the computation potentially non-associative and 
non-commutative. 

If a valid load 1 was found, it is the unique load in¬ 
struction inside the statement S that accesses (partially) 
the same memory as the store s and {ts (s))(l) is an as¬ 
sociative and commutative operation ©. We will refer to the 
quadruple (5, 1, ©, s) as the reduction-like computation Rc 
of S and denote the set of all reduction-like computations in 
a SCoP as TZc- 

It is worth noting that we explicitly allow the access func¬ 
tions of the load and the store to be different as for ex¬ 
ample shown in Figure In such cases a reduction can 
manifest only for certain parameter valuations or, as shown, 
for certain valuations of outer loop indices. Additionally, we 
could easily extend the definition to allow non-affine but 
Presburger accesses or even over-approximated non-affine 
accesses if they are pure. It is also worth to note that our 
definition does not restrict the shape of the induced reduc¬ 
tion dependences. 

for (1=0; 1 < N; i++) 
for (j = 0; j < M; j + +) 

A[ j] = A[ j-i] + Mat[i][ j] ; 

Figure 6: Conditional reduction with different access func¬ 
tions. 

3.2 Reduction Dependences 

While the data flow analysis performed on all polyhedral 
statements only marks reduct ion-like computations, we are 
actually interested in reduction dependences |20| . These loop 
carried self dependences start and end in two instances of 
the same reduction-like computation and they inherit some 
properties of this computation. Similar to the reduction-like 


computation, reduction dependences can be considered to be 
“associative” and “commutative”. The latter allows a sched¬ 
ule to reorder the iterations participating in the reduction¬ 
like computation while it can still be considered valid, how¬ 
ever all non-reduction dependences still need to be fulfilled. 

We split the set of all dependences T> into the set of re¬ 
duction dependences T>p C T> and the set of non-reduction 
dependences TXv : = D \ TXp. Now we can express the commu¬ 
tativity of a reduction dependence by extending the causal¬ 
ity condition given in Constraint as follows: 

< iQjir >£ Pv dqiiq) ^ OriiT) ( 2 ) 

<iQ,iT>£'Dp =4> OqIiq) ^ dTiir) (3) 

Constraint is the same as the original causality condi¬ 
tion (Constraint , except that we restrict the domain to 
non-reduction dependences TX,^. For the remaining reduc¬ 
tion dependences TXp, Constraint states that the schedule 
9 can reorder two iterations freely, as long as they are not 
mapped to the same time stamp. However, relaxing the 
causality condition for reduction dependences is only valid 
if TX contains all transitive reduction dependences. This is 
for example the case if TX is computed by a memory-based 
dependence analysis. In case only value-based dependence 
analysis was performed it is also sufficient to provide the 
missing transitive reduction dependences e.g., by recomput¬ 
ing them using a memory-based dependence analysis. 

Reconsider the BiCG kernel (Figure]^ and its non transi¬ 
tive (value-based) set of dependences TX shown in Figure]^ 
If we remove all reduction dependences TXp from TX, the only 
constraints left involve statement R and the iterations of 
statements S with j = 0. Consequently, there is no reason 
not to schedule the other instances of statement S before 
statement R. 

To address the issue of only value-based dependences with¬ 
out recomputing memory-based ones we use the transitive 
closure TXp s of the reduction dependences for a statement S 
(Equation]^. As the transitive closure of a Presburger rela¬ 
tion is not always a Presburger relation we might have to use 
an over-approximation to remain sound, however Pugh and 
Wonnacot describe how the transitive closure can also 
be computed precisely for exact direction/distance vectors. 
They also argue in later work that the transitive closure 
of value-based reduction dependences of real programs can 
be computed in an easy and fast way. 

If we now interpret TX'^ s as a relation that maps instances 
of a reduction statement S to all instances of S transi- 


4 


lively dependent, we can define privatization dependences 
T>t (Equation [^. In simple terms, T>t will ensure that 
no non-reduction statement accessing the reduction loca¬ 
tion can be scheduled in-between the reduction statement 
instances by extending the dependences ending or starting 
from a reduction access to all reduction access instances. 
This also means that in case no memory locations are reused 
e.g., after renaming and array expansion was applied, the 
set of privatization dependences will be empty. 

V+s- {Vpr\<Ts,Ts>)+ (4) 

Vr'■= {< *T, lip s(*s) > I < ir, >e T’<t,s>} 

U {< T’(J's(*s), *T > I < isT^T 'D<s,t>} (5) 

Privatization dependences overestimate the dependences 
that manual privatization of the reduction locations would 
cause. They are used to create alternative causality con¬ 
straint for the reduction statements that enforce the initial 
order between the reduction-like computation and any other 
statement accessing the reduction locations. To make use of 
them we replace Constraint by Constraint 

< ig, *T >€ (Hy U Ht) => 0q(*q) <C (6) 

If we now utilize the associativity of the reduction de¬ 
pendences we can compute intermediate results in any or¬ 
der before we combine them to the final result. As a con¬ 
sequence we can allow parallelization of the reduction-like 
computation, thus omit Constraint thereby eliminating 
the reduction dependences T>p from the causality condition 
of a schedule completely. However, parallel execution of it¬ 
erations connected by reduction dependences requires spe¬ 
cial “treatment” of the accesses during the code generation 
as described in Section ID 

The restriction on polyhedral statements, especially that 
it contains at most one store instruction, eases the identi¬ 
fication of reduction dependences; they are equal to all loop 
carried Write-After- Write self dependences over a statement 
with a reduction-like computatiorj^ Thus, T>p can be ex¬ 
pressed as stated in Equation]^ 

Vp := 'Dwaw<~^ {Is X Is \ {S, 1, ©, s) € TZc} (7) 

3.3 General Polyhedral Statements 

Practical polyhedral optimizer operate on different gran¬ 
ularities of polyhedral statements; a crucial factor for both 
compile time and quality of the optimization. While Clan 
operates on C statements, Polly is based on basic blocks 
in the SSA-based intermediate language of LLVM. The for¬ 
mer eases not only reduction handling but also offers more 
scheduling freedom. However, the latter can accumulate the 
effects of multiple C statements in one basic block, thus it 
can perform better with regards to compile time. Finding a 
good granularity for a given program, e.g., when and where 
to split a LLVM basic block in the Polly setting, is a research 
topic on its own but we do not want to limit our approach to 
one fixed granularity. Therefore, we will now assume a poly¬ 
hedral statement can contain multiple store instructions, 
thus we allow arbitrary basic blocks. 

As a first consequence we have to check that intermediate 
values of a reduction-like computation do not escape into 

^In this restricted environment we could also use the Read- 
After-Write (RAW) dependences instead of the WAW ones. 
® http://icps.u-strasbg.fr/~bastoul/development/clan/ 


non-reduction memory locations. This happens if and only 
if intermediate values—and therefore the reduction load— 
flow into multiple store instructions of the statement S. 
Additionally, other store instructions are not allowed to 
override intermediate values of the reduction computation. 
Thus, {S, 1, ©, s) can only be a reduction-like computations, 
if for all other store instructions s' in S: 

(t(s'))(l)=T A rangie(s') n ran(/e(s) n ran(?e(l) = 0 

Furthermore, we cannot assume that all loop carried WAW 
self dependences of a statement containing a reduction-like 
computation are reduction dependences: other read and 
write accesses contained in the statement could cause the 
same kind of dependences. However, we are particularly 
interested in dependences caused by the load and store 
instruction of a reduction-like computation Re- To track 
these accesses separately we can pretend they are contained 
in their own statement Sr^ that is executed at the same 
time as S (in the original iteration space). This is only 
sound as long as no other instruction in S accesses (partially) 
the same memory as the load or the store, but this was 
already a restriction on reduction-like computations. The 
definition of reduction dependences (Equation is finally 
changed to: 

Vp := VwAw n { X Vsji^ I Rc e TZc} (8) 

It is important to note the increased complexity of the 
dependence detection problem when we model reduction ac¬ 
cesses separately. However, our experiments in Section 
show that the effect is (in most cases) negligible. Further¬ 
more, we want to stress that this kind of separation is not 
equivalent to separating the reduction access at the state¬ 
ment level as we do not allow separate scheduling functions 
for S and Sr^. Similar to a fine-grained granularity at the 
statement level, separation might be desirable in some cases, 
however it suffers from the same drawbacks. 

4. PARALLEL EXECUTION 

When executing accesses to a reduction location x, p times 
in parallel, it needs to be made sure that the read-modify- 
write cycle on x happens atomically. While doing exactly 
that — performing atomic read-modify-write operations — 
might be a viable solution in some contexts |31| , it is gen¬ 
erally too expensive. The overhead of an atomic operation 
easily outweighs the actual work for smaller tasks [^. Ad¬ 
ditionally, the benefit of vectorization is lost for the reduc¬ 
tion, as atomic operations have to scalarize the computation 
again. We will therefore focus our discussion and the eval¬ 
uation on privatization as it is generally well-suited for the 
task at hand [18| . 

4.1 Privatizing Reductions 

Privatization means that every parallel context d, which 
might be a thread or just a vector lane, depending on the 
kind of parallelization, gets its own private location Xi for 
X. In front of the parallelized loop carrying a reduction de¬ 
pendence p, private locations xi, - ■ ■ ,Xp of x are allocated 
and initialized with the identity element of the correspond¬ 
ing reduction operation ©. Every parallel context Ci now 
non-atomically, and thus cheaply, modifies its very own lo¬ 
cation Xi. After the loop, but before the first use of the x, 
accumulation code needs to join all locations into x again, 
thus: a: := x © xi © ■ ■ ■ © Xp. 
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// (A) init 

for (i = 0; i < NX; i++) 

// (B) init 

for (j = 0; j < NY; j++) 

// (C) init 

for (k = 0; k < NZ; k++) 

P[j] += Q[i][j] * R[j][k]; 

// (C) aggregate 
// (B) aggregate 
// (A) aggregate 

Figure 7: Possible privatization locations (A-C) for the re¬ 
duction over P [ j ] . 


Such a privatization transformation is legal due to the 
properties of a reduction operation. Every possible user 
of X sees the same result after the final accumulation has 
been performed as it would have seen before the transfor¬ 
mation. Nevertheless we gained parallelism which cannot be 
exploited without the reduction properties. It might seem, 
that the final accumulation of the locations needs to be per¬ 
formed sequentially, but note that the number of locations 
does not necessarily grow with the problem size but instead 
only with the maximal number of parallel contexts. Fur¬ 
thermore, accumulation can be done in logarithmic time by 
parallelizing the accumulation correspondingly [^. 

One positive aspect of using privatization to fix a broken 
reduction dependence is that it is particularly well-suited for 
both ways of parallelization usually performed in the poly¬ 
hedral context: thread parallelism and vectorization. For 
thread parallelism real private locations of the reduction 
address are allocated; in case of vectorization, a vector of 
suitable width is used. 

As described, privatization creates “copies” of the reduc¬ 
tion location, one for each instance possibly executed in par¬ 
allel. While we can limit the number of private locations 
(this corresponds to the maximal number of parallel con¬ 
texts), we cannot generally bound the number of reduction 
locations. Furthermore, the number of necessary locations, 
as well as the number of times initialization and aggrega¬ 
tion is needed, varies with the placement of the privatization 
code. 

Consider the example in Figure Different possibilities 
exist to exploit reduction parallelism: using placement C 
for the privatization, the fc-loop could be executed in par¬ 
allel and only p private copies of the reduction location are 
necessary. There is no benefit in choosing location B as we 
then need px AF privatization locations (we have NY differ¬ 
ent reduction locations modified by the j-loop and p parallel 
contexts), but there is no gain in the amount of parallelism 
(the j-loop is already parallel). Finally, choosing location 
A for privatization might be worthwhile. We still only need 
p X NY privatized values, but save aggregation overhead: 
While for location C, p values are aggregated NX x NY 
times and for location B, p x NY locations are aggregated 
NX times, for location A, p x NY locations are aggregated 
only once. Furthermore, the i-loop can now be parallelized. 

In general, a trade-off has to be made between memory 
consumption, aggregation time and exploitable parallelism. 
Finding a good placement however is difficult and needs to 
take the optimization goal, the hardware and the workload 
size into account. Furthermore, depending on the schedul¬ 
ing, the choices for privatization code placement in the re¬ 


sulting code might be limited, which suggests that the sched¬ 
uler should be aware of the implications of a chosen schedule 
with respect to the efficiency of necessary privatization. 

In Section [6.1 1 we discuss the effect of different placement 
choices for the BiCG benchmark shown in Figure 

5. MODELING REDUCTIONS 

As mentioned earlier, the set B of all dependences is parti¬ 
tioned into the set Bp of reduction induced dependences and 
Bu of regular dependences. Reduction dependences inherit 
properties similar to associativity and commutativity from 
the reduction operator ©: the corresponding source and 
target statement instances can be executed in any order— 
provided © is a commutative operation—or in parallel—if 
© is at least associative. In order to exploit these proper¬ 
ties the polyhedral optimizer needs to be aware of them. To 
this end we propose different scheduling and code generation 
schemes. 

Reduction-Enabled Code Generation 

is a simple, non-invasive method to realize reductions 
during the code generation, thus without modification 
of the polyhedral representation of the SCoP. 

Reduction-Enabled Scheduling 

exploits the properties of reductions in the polyhedral 
representation. All reduction dependences are basi¬ 
cally ignored during scheduling, thereby increasing the 
freedom of the scheduler. 

Reduction-Aw are Scheduling 

is the representation of reductions and their realiza¬ 
tion via privatization in the polyhedral optimization. 
The scheduler decides when and where to make use of 
reduction parallelism. However, non-trivial modifica¬ 
tions of the polyhedral representation and the current 
state-of-the-art schedulers are necessary. 

5.1 Reduction-Enabled Code Generation 

The reduction-enabled code generation is a simple, non- 
invasive approach to exploit reduction parallelism. The only 
changes needed to enable this technique are in the code gen¬ 
eration, thus the polyhedral representation is not modified. 
So far, dimensions or loops are marked parallel if they do 
not carry any dependences. With regards to reduction de¬ 
pendences we can relax this condition, hence we can mark 
non-parallel dimensions or loops as parallel, provided we add 
privatization code, if they only carry reduction dependences. 
To implement this technique we add one additional check to 
the code generation that is executed for each non-parallel 
loop of the resulting code that we want to parallelize. It 
uses only non-reduction dependences Bi, not B to deter¬ 
mine if the loop exclusively carries reduction dependences. 
If so, the reduction locations corresponding to the broken 
dependences are privatized and the loop is parallelized. 

Due to its simplicity, it is easily integrable into existing 
optimizers while the compile time overhead is reasonably 
low. However, additional heuristics are needed. First, to 
decide if reductions should be realized e.g., if privatization 
of a whole array is worth the gain in parallelism. And sec¬ 
ond, where the privatization statements should be placed 
(cf. Section [4.1[ ). Note that usually the code generator has 
no, and in fact should not have any, knowledge of the opti¬ 
mization goal of the scheduler. 
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Apart from the need for heuristics, reduction-aware code 
generation also misses opportunities to realize reductions ef¬ 
fectively. This might happen if the scheduler has no reason 
to perform an enabling transformation or the applied trans¬ 
formation even disabled the reduction. Either way, it is hard 
to predict the outcome of this approach. 


5.2 Reduction-Enabled Scheduling 

In contrast to reduction-aware code generation, which is 
basically a post-processing step, reduction-enabled schedul¬ 
ing actually influences the scheduling processes by elimi¬ 
nating reduction dependences beforehand. Therefore, the 
scheduler is (1) unaware of the existence of reductions and 
their dependences and (2) has more freedom to schedule 
statements if they contain reduction instances. While this 
technique allows to exploit reductions more aggressively, there 
are still disadvantages. First of all, this approach relies 
on reduction-aware code generation as a back-end, hence it 
shares the same problems. Second, the scheduler’s unaware¬ 
ness of reduction dependences prevents it from associating 
costs to reduction realization. Thus, privatization is implic¬ 
itly assumed to come for free. Consequently, the scheduler 
does not prefer existing, reduction-independent parallelism 
over reduction parallelism and therefore may require unnec¬ 
essary privatization code. 

For the BiCG example (Figure]^ omitting the reduction 
dependences might not result in the desired schedule if we 
assume we are only interested in one level of outermost par- 
allelisirQand furthermore that the statements S and T have 
been split prior to the scheduling. In this situation we want 
to interchange the outer two loops for the T statement in 
order to utilize the inherent parallelism, not the reduction 
parallelism. However, without the reduction dependences 
the scheduler will not perform this transformation. In order 
to decrease the severity of this problem, the reduction de¬ 
pendences can still be used in the proximity constraints of 
the scheduler 


29 


thus the scheduler will try to minimize 
the dependence distance between reduction iterations and 
implicitly move them to inner dimensions. This solves the 
problem for all Polybench benchmarks with regards to out¬ 
ermost parallelism, however it might negatively affect vec- 
torization if e.g., the innermost parallel dimension is always 
vectorized. 


5.3 Reduction-Aware Scheduling 

Reduction-enabled scheduling results in generally good 
schedules for our benchmark set, however resource constraints 
as well as environment effects, both crucial to the overall per¬ 
formance, are not represented in the typical objective func¬ 
tion used by polyhedral optimizers. In essence we believe, 
the scheduler should be aware of reductions and the cost of 
their privatization, in terms of memory overhead as well as 
aggregation costs. This is especially true if the scheduler 
is used to decide which dimensions should be executed in 
parallel or if there are tight memory bounds (e.g., on mobile 
devices). 

In Section [6.II we show that the execution environment as 
well as the values of runtime parameters are crucial factors in 
the actual performance of parallelized code, even more when 
reductions are involved. While a reduction-aware scheduler 
could propose different parallelization schemes for different 

reasonable assumption for desktop computers or moder¬ 
ate servers with a low number of parallel compute resources. 


Parallel 

X 

2^" X 2i" 

2^^ X 2^^ 

2^'* X 2^'* 

Outer 

0.19 0.55 

2.31 0.75 

3.91 0.72 

2.19 0.96 

Tile 

0.03 1.10 

0.32 1.54 

0.10 1.60 

0.16 2.21 


Table 1: BiCG run-time results. The values are speedups 
compared to the sequential Polly version, first for the 32- 
core machine, then for the 4-core machine. 


execution environments or parameter values, there is more 
work needed in order to (1) predict the effects of paralleliza¬ 
tion and privatization on the actual platform and to (2) ex¬ 
press them as affine constraints in the scheduling objective 
function. 


6. EVALUATION 

We implemented Reduction-Enabled Scheduling (c.f., Sec- 
tion |5.2[ | in the polyhedral optimizer Polly and evaluated the 
effects on compile time and run-time on the Polybench 3.2. 
We used an Intel(R) core i7-4800MQ quad core machine 
and the standard input size of the benchmarks. 

Our approach is capable of identifying and modeling all 
reductions as described in Section]^ in total 52 arithmetic 
reductions in 30 benchmarks 0 

As described earlier, our detection virtually splits polyhe¬ 
dral statements to track the effects of the load and store 
instructions that participate in reduction-like computations. 
As this increases the complexity of the performed depen¬ 
dence analysis we timed this particular part of the compila¬ 
tion for each of the benchmarks and compared our ?? hybrid 
dependence analysis to a completely ?? access-wise analysis 
and the default ?? statement-wise one. We use the term hy¬ 
brid because reduction accesses are tracked separately while 
other accesses are accumulated on statement level. 

As shown in Figure]^ (top) our approach takes up to 5x 
as long (benchmark lu) than the default implementation but 
in average only 85% more. Access-wise dependence compu¬ 
tation however is up to 10 x slower than the default and 
takes in average twice as long as our hybrid approach. Note 
that both approaches do not only compute the dependences 
(partially) on the access level but also the reduction and 
privatization dependences as explained in Section [3.2[ 

Figure (bottom) shows the speedup of our approach 
compared to the non-reduction Polly. The additional schedul¬ 
ing freedom causes speedups for the data-mining applica¬ 
tions (correlation and covariance) but slowdowns especially 
for the matrix multiplication kernels (2mm, 3mm and gemm). 
This is due to the way Polly generates vector code. The 
deepest dimension of the new schedule that is parallel (or 
now reduction parallel) will be strip-mined and vectorized. 
Hence the stride of the contained accesses, crucial to gener¬ 
ate efficient vector code, is not considered. However, we do 
not believe this to be a general shortcoming of our approach 
as there are existing approaches to tackle the problem of 
finding a good vector dimension 13 that would benefit from 
the additional scheduling freedom as well as the knowledge 
of reduction dependences. 


6.1 BiCG Case Study 

Polybench is a collection of inherent parallel programs, 
there is only one—the BiCG kernel— that depends on re- 

®This assumes the benchmarks are compiled with -ffast- 
math, otherwise reductions over floating point computations 
are not detected. 
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duction parallelism. To study the effects of parallelization 
combined with privatization of multidimensional reductions 
in the BiCG kernel we compared two parallel versions to 
the non-parallel code Polly would generate without reduc¬ 
tion support. The first version Outer” has a parallel outer¬ 
most loop and therefore needs to privatize the whole array 
s. The second version “Tile” parallelizes the second out¬ 
ermost loop. Due to tiling, only “tile size” (here 32) loca¬ 
tions of the q array need to be privatized. Table shows 
the speedup compared to the sequential version for both a 
quad core machine and a 8 x 4-core server. As the input 
grows larger the threading overhead as well as the inter¬ 
chip communication on the server will cause the speedup of 
Tile to stagnate, however on a one chip architecture this 
version generally performs best. Outer on the other hand 
will perform well on the server but not on the 4-core ma¬ 
chine. We therefore believe the environment is a key factor 
in the performance of reduction-aware parallelization and a 
reduction-aware scheduler is needed to decide under which 
run-time conditions privatization becomes beneficial. 


7. RELATED WORK 

Reduction aware loop parallelization has been a long last¬ 
ing research topic. Different approaches to detect reduction, 
to model them and finally to optimize them have been pro¬ 
posed. As our work has some intersection with all three 
parts we will discuss them in separation. 


7.1 Detection 

Reduction detection started with pattern based approaches 

^[ 22 ] 


on source statements 
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21 


24 


and evolved to 


more elaborate techniques that use symbolic evaluation [^, 
a data dependency graph or even a program dependency 
graph to find candidates for reduction computations. 

For functional programs Xu et al. [30| use a type system 
to deduce parallel loops including pattern based reductions. 
Their typing rules are similar to our detection function (Fig¬ 
ure we use to identify reduction-like computations. 

Sato and Iwasaki describe a pragmatic system to de¬ 
tect and parallelize reduction and scan operations based on 
the ideas introduced by Matsuzaki et al. [^: the represen¬ 
tation of (part of) the loop as a matrix multiplication with 
a state vector. They can handle mutually recursive scan 
and reduction operations as well as maximum computations 
implemented with conditionals, but they are restricted to 
innermost loops and scalar accumulation variables. As an 
extension Zou and Rajopadhye combined the work with 
the polyhedral model and the recurrence detection approach 
of Redon and Feautrier [22[ |24| . This combination over¬ 
comes many limitations, e.g., multidimensional reductions 
(and scans) over arrays are handled. However, the applica¬ 
bility is still restricted to scans and reductions representable 
in State Vector Update Form 
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In our setting we identify actual reductions utilizing the 
already present dependence analysis, an approach very sim¬ 
ilar to the what Suganuma et al. [27| proposed to do. How¬ 
ever, we only perform the expensive, access-wise dependence 
analysis for reduction candidates, and not for all accesses in 
the SCoP. Nevertheless, both detections do not need the 
reductions to be isolated in a separate loop as assumed by 
Fisher and Ghuloum or Pottenger and Eigenmann [18| . 
Furthermore, we allow the induced reduction dependences 
to be of any form and carried by any subset of outer loop 


dimensions. This is similar to the nested Recur operator in¬ 
troduced by Redon and Feautrier |22||24| . Hence, reductions 
are not only restricted to a single loop dimension, as in other 
approaches [11[ [^[^ , but can also be multidimensional as 
shown in Figure^ 


7.2 Modeling 

Modeling reductions was commonly done implicitly, e.g., 
by ignoring the reduction dependences during a post paral¬ 
lelization step 
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17 22 18 


21 


30 28 


This is comparable 
to the reduction-enabled code generation described in Sec- 
tion |5.1| However, we believe the full potential of reductions 
can only be exposed when the effects are properly modeled 
on the dependence level. 

The first to do so, namely to introduce reduction depen¬ 
dences, where Pugh and Wonnacot [20] . Similar to most 
other approaches (22) [2^ [24) 1^ 1^, the detection 

and modeling of the reduction was performed only on C- 
like statements and utilizing a precise but costly access-wise 
dependence analysis (see the upper part of Figure]^. In 
their work they utilize both memory and value-based de¬ 
pendence information to identify statements with an itera¬ 
tion space that can be executed in parallel, possibly after 
transformations like array expansion. They start with the 
memory-based dependences and compute the value-based 
dependences as well as the transitive self-dependence rela¬ 
tion for a statement in case the statement might not be 
inherently sequential. 

Stock et al.[^ describe how reduction properties can be 
exploited in the polyhedral model, however neither do they 
describe the detection nor how omitting reduction depen¬ 
dences may affect other statements. 

In the works of Redon and Feautrier [^ as well as the 
extension to that by Gupta et al. [^ the reduction model¬ 
ing is performed on SAREs on which array expansion [^ 
and renaming was applied, thus all dependences caused by 
memory reuse were eliminated. In this setting the possible 
interference between reduction computation and other state¬ 
ments is simplified but it might not be practical for general 
purpose compilers due to memory constraints. As an exten¬ 
sion to these scheduling approaches on SAREs we introduced 
privatization dependences. They model the dependences be¬ 
tween a reduction and the surrounding statements without 
the need for any special preprocessing of the input. How¬ 
ever, we still allow polyhedral optimizations that will not 
only affect the reduction statement but all statements in a 
SCoP. 


7.3 Optimization 

Optimization in the context of reductions is twofold. There 
is the parallelization of the reduction as it is given in the in¬ 
put and the transformation as well as possible parallelization 
of the input with awareness of the reduction properties. The 
first idea is very similar to the reduction-enabled code gen¬ 
eration as described in Section |5.1| In different variations, 
innermost loops [^, loops containing only a reduction [^ 
|18| or recursive functions computing a reduction [^ were 
parallelized or replaced by a call to a possibly parallel re¬ 
duction implementation |^. The major drawback of such 
optimizations is that reductions have to be computed ei¬ 
ther in isolation or with the statements that are part of 
the source loop that is parallelized. Thus, the reduction 
statement instances are never reordered or interleaved with 
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other statement instances, even if it would be beneficial. In 
order to allow powerful transformations in the context of re¬ 
ductions, their effect, hence the reduction dependences, as 
well as their possible interactions with all other statement 
instances must be known. The f irst polyhedral scheduling 
approach by Redon and Feautrier 23 that optimalljj^sched- 
ules reduction together with other statements assumed re¬ 
ductions to be computable in one time step. With such 
atomic reduction computations there are no reduction state¬ 
ment instances that could be reordered or interleaved with 
other statement instances. Gupta et al. extended that 
work and lifted the restriction on an atomic reduction com¬ 
putation. As they schedule the instances of the reduction 
computation together with the instances of all other state¬ 
ments their work can be seen as a reduction-enabled sched¬ 
uler that optimally minimizes the latency of the input. 

To speed up parallel execution of reductions the runtime 
overhead needs to be minimized. Pottenger proposed to 
privatize the reduction locations instead of locking them for 
each access and Suganuma et al. described how multiple 
reductions on the same memory location can be coalesced. 
If dynamic reduction detection 21 was performed, different 
privatization schemes to minimize the memory and runtime 
overhead were proposed by Yu et al. [^. While the latter 
is out of scope for a static polyhedral optimizer, the former 
might be worth investigating once our approach is extended 
to multiple reductions on the same location. 

In contrast to polyhedral optimization or parallelization, 
Gautam and Rajopadhye exploited reduction properties 
in the polyhedral model to decrease the complexity of a com¬ 
putation in the spirit of dynamic programming. Their work 
on reusing shared intermediate results of reduction compu¬ 
tations is completely orthogonal to ours. 

While Array Expansion, as introduced by Feautrier i , is 
not a reduction optimization, it is still similar to the priva¬ 
tization step of any reduction handling approach. However, 
the number of privatization copies the approach introduces, 
the accumulation of these private copies as well as the kind 
of dependences that are removed differ. While privatization 
only introduces a new location for each processor or vector 
lane, general array expansion introduces a new location for 
each instance of the statement. In terms of dependences, 
array expansion aims to remove false output and anti de¬ 
pendences that are introduced by the reuse of memory while 
reduction handling approaches break output and flow depen¬ 
dences that are caused by a reduction computation. Because 
of the flow dependences—the actual reuse of formerly com¬ 
puted values—the reduction handling approaches also need 
to implement a more elaborate accumulation scheme that 
combines all private copies again. 


Instead we want to allow any transformation possible to our 
scheduler with only one restriction: the integrity of the re¬ 
duction computation needs to stay intact. In other words, 
no access to the reduction location is scheduled between the 
first and last instance of the reduction statement. This al¬ 
lows our scheduler not only to optimize the reduction state¬ 
ment in isolation, but also to consider other statements at 
the same time without the need for any preprocessing to get 
a SARE-like input. 

To this end we presented a powerful reduction detection 
based on computation properties and the polyhedral depen¬ 
dence analysis. Our design leverages the power of polyhe¬ 
dral loop transformations and exposes various optimization 
possibilities including parallelism in the presence of reduc¬ 
tion dependences. We showed how to model and leverage 
associativity and commutativity to relax the causality con¬ 
straints and proposed three approaches to make polyhedral 
loop optimization reduction-aware. We believe our frame¬ 
work is the first step to handle various well-known idioms, 
e.g., privatization or recurrences, not yet exploited in most 
practical polyhedral optimizers. 

Furthermore, we showed that problems and opportuni¬ 
ties arising from reduction parallelism (see Section [6.1[ ) have 
to be incorporated into the scheduling process, thus the 
scheduling in the polyhedral model needs to be done in a 
more realistic way. The overhead of privatization and the 
actual gain of parallelism are severely influenced by the exe¬ 
cution environment (e.g., available resources, number of pro¬ 
cessors and cores, cache hierarchy), however these hardware 
specific parameters are often not considered in a realistic 
way during the scheduling process. 

Extensions to this work include a working reduction-aware 
scheduler and the modeling of multiple reduction-like com¬ 
putations as well as other parallelization preventing idioms. 
In addition we believe that a survey about the applicability 
of different reduction detection schemes as well as optimiza¬ 
tion approaches in a realistic environment is needed. In any 
case this would help us to understand reductions not only 
from the theoretical point of view but also from a practical 
one. 
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8. CONCLUSIONS AND FUTURE WORK 

Earlier work already utilized reduction dependences in dif¬ 
ferent varieties, depending on how powerful the detection 
was. Whenever reductions have been parallelized the reduc¬ 
tion dependences have been implicitly ignored, in at least 
two cases they have even been made explicit 20 26 . How¬ 


ever, to our knowledge, we are the first to add the concept 
of privatization dependences in this context. The reason is 
simple: we believe the parallel execution of a loop contain¬ 
ing a reduction is not always the best possible optimization. 


“e.g., according to the latency 


10. REFERENCES 

[1] M.-W. Benabderrahmane, L.-N. Pouchet, A. Cohen, 
and C. Bastoul. The polyhedral model is more widely 
applicable than you think. In Proceedings of the 19th 
Joint European Conference on Theory and Practice of 
Software, International Conference on Compiler 
Construction, CCTO/ETAPSTO, pages 283-303, 
Berlin, Heidelberg, 2010. Springer-Verlag. 

[2] G. E. Blelloch. Scans as primitive parallel operations. 
IEEE Trans. Comput., 38(11):1526-1538, Nov. 1989. 

[3] U. Bondhugula, M. M. Baskaran, S. Krishnamoorthy, 
J. Ramanujam, A. Rountev, and P. Sadayappan. 


9 








0 Access-wise dependences 


Hybrid dependences 


o Statement-wise dependences 


o 

cj 

QJ 

CO 


O 

o 


0 ) 

in 




Figure 8: Evaluation results for Polybench 3.2. In the upper part the compile time for different grained dependence analyses 
is shown, in the lower part the speednp of Polly with reduction support compared to Polly without reduction support. 


Automatic transformations for 

communication-minimized parallelization and locality 
optimization in the polyhedral model. ETAPS ’08. 

[4] P. Feautrier. Array expansion. In Proceedings of the 
2Nd International Conference on Supercomputing, ICS 
’88, pages 429-441, New York, NY, USA, 1988. ACM. 

[5] P. Feautrier. Dataflow analysis of array and scalar 
references. International Journal of Parallel 
Programming, 20(l):23-53, 1991. 

[6] A. L. Fisher and A. M. Ghuloum. Parallelizing 
complex scans and reductions. In Proceedings of the 
ACM SIGPLAN 1994 Conference on Programming 
Language Design and Implementation, PLDI ’94, 
pages 135-146, New York, NY, USA, 1994. ACM. 

[7] Gautam and S. Rajopadhye. Simplifying reductions. 

In Conference Record of the 33rd ACM 
SIGPLAN-SIGACT Symposium on Principles of 
Programming Languages, POPE ’06, pages 30-41, New 
York, NY, USA, 2006. ACM. 

[8] T. Grosser, A. Grofilinger, and C. Lengauer. Polly - 
performing polyhedral optimizations on a low-level 
intermediate representation. Parallel Processing 
Letters, 22(4), 2012. 

[9] G. Gupta, S. Rajopadhye, and P. Quinton. Scheduling 
reductions on realistic machines. In Proceedings of the 
Fourteenth Annual ACM Symposium on Parallel 
Algorithms and Architectures, SPAA ’02, pages 
117-126, New York, NY, USA, 2002. ACM. 


[10] P. Jouvelot. Parallelization by semantic detection of 
reductions. In Proc. Of the European Symposium on 
Programming on ESOP 86, pages 223-236, New York, 
NY, USA, 1986. Springer-Verlag New York, Inc. 

[11] P. Jouvelot and B. Dehbonei. A unified semantic 
approach for the vectorization and parallelization of 
generalized reductions. In Proceedings of the 3rd 
International Conference on Supercomputing, ICS ’89, 
pages 186-194, New York, NY, USA, 1989. ACM. 

[12] P. M. Kogge and H. S. Stone. A parallel algorithm for 
the efhcient solution of a general class of recurrence 
equations. IEEE Trans. Comput., 22(8):786-793, Aug. 
1973. 

[13] M. Kong, R. Veras, K. Stock, F. Franchetti, L.-N. 
Pouchet, and P. Sadayappan. When polyhedral 
transformations meet simd code generation. In 
Proceedings of the 34th ACM SIGPLAN Conference 
on Programming Language Design and 
Implementation, PLDI ’13, pages 127-138, New York, 
NY, USA, 2013. ACM. 

[14] C. Lattner and V. Adve. Llvm: A compilation 
framework for lifelong program analysis & 
transformation. In Proceedings of the International 
Symposium on Code Generation and Optimization: 
Feedback-directed and Runtime Optimization, CGO 
’04, pages 75-, Washington, DC, USA, 2004. IEEE 
Computer Society. 

[15] K. Matsuzaki, Z. Hu, and M. Takeichi. Towards 
automatic parallelization of tree reductions in 


10 














dynamic programming. In Proceedings of the 
Eighteenth Annual ACM Symposium on Parallelism in 
Algorithms and Architectures, SPAA ’06, pages 39-48, 
New York, NY, USA, 2006. ACM. 

[16] S. P. Midkiff. Automatic Parallelization: An Overview 
of Fundamental Compiler Techniques. Synthesis 
Lectures on Computer Architecture. 2012. 

[17] S. S. Pinter and R. Y. Pinter. Program optimization 
and parallelization using idioms. In Proceedings of the 
18th ACM SICPLAN-SIGACT Symposium on 
Principles of Programming Languages, POPL ’91, 
pages 79-92, New York, NY, USA, 1991. ACM. 

[18] B. Pottenger and R. Eigenmann. Idiom recognition in 
the polaris parallelizing compiler. In Proceedings of the 
9th International Conference on Supercomputing, ICS 
’95, pages 444-448, New York, NY, USA, 1995. ACM. 

[19] W. Pugh. Uniform techniques for loop optimization. 

In Proceedings of the 5th International Conference on 
Supercomputing, ICS ’91, pages 341-352, New York, 
NY, USA, 1991. ACM. 

[20] W. Pugh and D. Wonnacott. Static analysis of upper 
and lower bounds on dependences and parallelism. 
ACM Trans. Program. Lang. Syst., 16(4):1248-1278, 
July 1994. 

[21] L. Rauchwerger and D. Padua. The Irpd test: 
Speculative run-time parallelization of loops with 
privatization and reduction parallelization. In 
Proceedings of the ACM SIGPLAN 1995 Conference 
on Programming Language Design and 
Implementation, PLDI ’95, pages 218-232, New York, 
NY, USA, 1995. ACM. 

[22] X. Redon and P. Feautrier. Detection of recurrences in 
sequential programs with loops. In Proceedings of the 
5th International PARLE Conference on Parallel 
Architectures and Languages Europe, PARLE ’93, 
pages 132-145, London, UK, UK, 1993. 

Springer-Verlag. 

[23] X. Redon and P. Feautrier. Scheduling reductions. In 
Proceedings of the 8th International Conference on 
Supercomputing, ICS ’94, pages 117-125, New York, 
NY, USA, 1994. ACM. 

[24] X. Redon and P. Feautrier. Detection of scans in the 
polytope model. Parallel Algorithms AppL, 

15(3-4):229-263, 2000. 

[25] S. Sato and H. Iwasaki. Automatic parallelization via 
matrix multiplication. In Proceedings of the 32Nd 
ACM SIGPLAN Conference on Programming 
Language Design and Implementation, PLDI ’ll, 
pages 470-479, New York, NY, USA, 2011. ACM. 

[26] K. Stock, M. Kong, T. Grosser, L.-N. Pouchet, 

F. Rastello, J. Ramanujam, and P. Sadayappan. A 
framework for enhancing data reuse via associative 
reordering. In Proceedings of the 35th ACM SIGPLAN 
Conference on Programming Language Design and 
Implementation, PLDI ’14, pages 65-76, New York, 
NY, USA, 2014. ACM. 

[27] T. Suganuma, H. Komatsu, and T. Nakatani. 

Detection and global optimization of reduction 
operations for distributed parallel machines. In 
Proceedings of the 10th International Conference on 
Supercomputing, ICS ’96, pages 18-25, New York, NY, 
USA, 1996. ACM. 


[28] A. Venkat, M. Shantharam, M. Hall, and M. M. 
Strout. Non-affine extensions to polyhedral code 
generation. In Proceedings of Annual lEEE/ACM 
International Symposium on Code Generation and 
Optimization, CGO ’14, pages 185:185-185:194, New 
York, NY, USA, 2014. ACM. 

[29] S. Verdoolaege. Isl: An integer set library for the 
polyhedral model. In Proceedings of the Third 
International Congress Conference on Mathematical 
Software, ICMS’lO, pages 299-302, Berlin, Heidelberg, 
2010. Springer-Verlag. 

[30] D. N. Xu, S.-C. Khoo, and Z. Hu. Ptype system: A 
featherweight parallelizability detector. In In 
Proceedgins of 2nd Asian Symposium on Programming 
Languages and Systems (APLAS 2004), LNCS 3302, 
pages 197-212. Springer, LNCS, 2004. 

[31] H. Yu, D. Zhang, and L. Rauchwerger. An adaptive 
algorithm selection framework. In Proceedings of the 
13th International Conference on Parallel 
Architectures and Compilation Techniques, PACT ’04, 
pages 278-289, Washington, DC, USA, 2004. IEEE 
Computer Society. 

[32] Y. Zou and S. Rajopadhye. Scan detection and 
parallelization in ’’inherently sequential” nested loop 
programs. In Proceedings of the Tenth International 
Symposium on Code Generation and Optimization, 
CGO ’12, pages 74-83, New York, NY, USA, 2012. 
ACM. 


11 



